Skip to content

metis: implement gRPC daemon server#1024

Open
YifeiZhuang wants to merge 12 commits intokubernetes:masterfrom
YifeiZhuang:impl-daemon-server
Open

metis: implement gRPC daemon server#1024
YifeiZhuang wants to merge 12 commits intokubernetes:masterfrom
YifeiZhuang:impl-daemon-server

Conversation

@YifeiZhuang
Copy link
Copy Markdown
Contributor

@YifeiZhuang YifeiZhuang commented Mar 27, 2026

Key changes include:

  • Implement standard adaptiveipam gRPC server (daemon_server.go) to listen for pod IP allocation requests over a Unix Domain Socket. And implement rpc AllocatePodIP and DeallocatePodIP.
    • Implemented retries for DB errors within daemon serer.
  • Implement new methods in the Store to interface with the SQLite DB, supporting idempotency for all:
    • AddCIDR: Add CIDR blocks and seeds individual IP addresses.
    • AllocateIPv4: Find available IP slots and flips is_allocated to true.
    • ReleaseIPByOwner: Releases pod IP addresses by owner identifiers and sets cooldown period timestamp.

Threading model:

  • Each RPC request can call store concurrently to optimize request latency. The DB transactions guarantees thread safet between concurrent requests. Existing WAL mode and busy_timeout supports high concurrent read/write operations without locking.

No implementation for IPv6 yet.

/label tide/merge-method-squash

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 27, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 27, 2026
@YifeiZhuang
Copy link
Copy Markdown
Contributor Author

/test

@YifeiZhuang
Copy link
Copy Markdown
Contributor Author

/test all

@YifeiZhuang YifeiZhuang marked this pull request as ready for review April 4, 2026 00:03
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2026
@k8s-ci-robot k8s-ci-robot requested a review from seans3 April 4, 2026 00:03
@YifeiZhuang
Copy link
Copy Markdown
Contributor Author

@arvindbr8 @gnossen @zhaoqsh PTAL

Copy link
Copy Markdown
Contributor

@arvindbr8 arvindbr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @YifeiZhuang for taking care of this. I have done an initial pass and had a few comments.

t.Fatalf("AddCIDR failed: %v", err)
}

const numGoroutines = 50 // High contention
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we capped MaxOpenConns at 10 on the DB pool, won't Go just throttle these 50 goroutines? I'm wondering if this will actually test 50 concurrent SQLite operations, or if it just tests 10 at a time while the other 40 sit in Go's connection queue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first 10 goroutines will get initial connections. For the remaining they will be waiting for an until other execution is paused. It will not throttle or error. It looks to me the right number of goroutines that we can test for. I've verified the sqlite waiting stats.

wg.Wait()
// Retrieve DB stats
stats := s.DB().Stats()
t.Logf("DB Stats after high contention: %+v", stats)
// Verify that at least some goroutines had to wait for a connection
if stats.WaitCount == 0 {
    t.Errorf("Expected some goroutines to wait for connections, but WaitCount was 0")
} else {
    t.Logf("Success: %d requests had to wait for a connection.", stats.WaitCount)
}

@arvindbr8
Copy link
Copy Markdown
Contributor

pull-cloud-provider-gcp-metis-glibc-floor-test — Job failed.

you will likely need to rebase to master to pull in changes from #1036

@arvindbr8
Copy link
Copy Markdown
Contributor

@YifeiZhuang a go mod tidy in our directory should get the glibc floor test passing

@YifeiZhuang
Copy link
Copy Markdown
Contributor Author

@YifeiZhuang a go mod tidy in our directory should get the glibc floor test passing

I added to the tool: #1038, no wonder it does not fail verify-up-to-date.

Copy link
Copy Markdown
Contributor

@arvindbr8 arvindbr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks Ivy, this is great!

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arvindbr8, YifeiZhuang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@YifeiZhuang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
cloud-provider-gcp-e2e-full 5bcae85 link true /test cloud-provider-gcp-e2e-full

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants